Model Selection

Multimodal Document Retrieval

# Multimodal Document Retrieval

Colnomic Embed Multimodal 7b

ColNomic Embed Multimodal 7B is a state-of-the-art multi-vector multimodal embedding model, excelling in visual document retrieval tasks with support for multilingual and unified text-image encoding.

Multimodal Fusion Supports Multiple Languages

Ret OpenCLIP ViT G 14

ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by integrating multi-level representations from visual and textual backbone networks.

Multimodal Fusion

Ret OpenCLIP ViT H 14

ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by integrating multi-level representations from vision and text backbone networks.

Multimodal Fusion

Ret CLIP ViT L 14

ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by fusing multi-level representations from vision and text backbone networks.

Multimodal Fusion

Colqwen2.5 3b Multilingual V1.0

A multilingual visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, supporting dynamic input image resolution and multilingual document retrieval.

Text-to-Image Supports Multiple Languages

Colqwen2.5 3b Multilingual V1.0 Merged

A multilingual visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, supporting dynamic input image resolution and generating ColBERT-style multi-vector text and image representations.

Transformers Supports Multiple Languages

Colqwen2.5 7b Multilingual V1.0

A multilingual visual retrieval model based on Qwen2.5-VL-7B-Instruct using the ColBERT strategy, ranked first in the Vidore benchmark

Text-to-Image Supports Multiple Languages

Colqwen2.5 3b Multilingual V1.0

A multilingual visual retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy, excelling in Vidore benchmark tests

Text-to-Image Supports Multiple Languages

Colqwen2.5 V0.1

A visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, capable of generating multi-vector representations for text and images to enable efficient document retrieval.

Safetensors English

Colqwen2 7b V1.0

A visual retrieval model based on Qwen2-VL-7B-Instruct using ColBERT strategy, focusing on efficient visual feature indexing for documents

Text-to-Image Supports Multiple Languages

Colqwen2 7b V1.0

A visual retrieval model based on Qwen2-VL-7B-Instruct and ColBERT strategy, supporting multi-vector text and image representation

Text-to-Image English

Colpali V1.3 Hf

ColPali is a vision-language model extended from PaliGemma-3B, capable of efficiently indexing documents through visual features and generating ColBERT-style multi-vector representations.

Transformers English

VisRAG is a retrieval-augmented generation (RAG) system based on vision-language models (VLM) that can directly embed documents as images, avoiding information loss caused by traditional text parsing.

Text-to-Image English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase